Introduction

This is an analysis of US presidential elections data for 2016 at the county level. Since only a small percentage of votes went to independent candidates, we will only compare Democrat and Republican voteshare.

The data for this analysis is taken from https://github.com/tonmcg/County_Level_Election_Results_12-16.

Data import and checking

Library imports:

library(dplyr)
library(ggplot2)
library(knitr)
library(readr)

Read in the data:

df <- read_csv("2016_US_County_Level_Presidential_Results.csv")
kable(head(df))
X1 votes_dem votes_gop total_votes per_dem per_gop diff per_point_diff state_abbr county_name combined_fips
0 93003 130413 246588 0.3771595 0.52887 37410 15.17% AK Alaska 2013
1 93003 130413 246588 0.3771595 0.52887 37410 15.17% AK Alaska 2016
2 93003 130413 246588 0.3771595 0.52887 37410 15.17% AK Alaska 2020
3 93003 130413 246588 0.3771595 0.52887 37410 15.17% AK Alaska 2050
4 93003 130413 246588 0.3771595 0.52887 37410 15.17% AK Alaska 2060
5 93003 130413 246588 0.3771595 0.52887 37410 15.17% AK Alaska 2068

There are 3,141 rows in total, matching the number of counties in the US (Source: http://www.snopes.com/trump-won-3084-of-3141-counties-clinton-won-57/ and http://www.wnd.com/2016/12/trumps-landslide-2623-to-489-among-u-s-counties/).

The dataset contains the following columns:

names(df)
##  [1] "X1"             "votes_dem"      "votes_gop"      "total_votes"   
##  [5] "per_dem"        "per_gop"        "diff"           "per_point_diff"
##  [9] "state_abbr"     "county_name"    "combined_fips"

combined_fips is the column that we are going to use to combine our elections data with our mapping data.

Since we are interested in whether a given county had more Republican or Democrat votes, we have to recompute the diff and per_point_diff columns. diff and per_point_diff will be positive if there are more Republican votes than Democrat votes (and vice versa).

df <- df %>% mutate(diff = votes_gop - votes_dem,
                    per_point_diff = diff / total_votes * 100)

Summary statistics

Compute percentage of popular vote won by each party:

paste0("Republican % of popular vote: ", 
       round(sum(df$votes_gop) / sum(df$total_votes) * 100, digits = 1),
       "%")
## [1] "Republican % of popular vote: 47.3%"
paste0("Democrat % of popular vote: ", 
       round(sum(df$votes_dem) / sum(df$total_votes) * 100, digits = 1),
       "%")
## [1] "Democrat % of popular vote: 47.5%"

Although Clinton lost the presidential election, she actually won the popular vote!

Compute number of counties won by each party:

df %>% transmute(gop_won = votes_gop > votes_dem) %>%
    summarize(gop_won = sum(gop_won))
## # A tibble: 1 x 1
##   gop_won
##     <int>
## 1    2654

Painting a completely different picture, Trump won 2654 out of 3141 counties (or 84.5% of all counties). Clinton only won 487 counties. This suggests that Clinton won in counties with large populations, or that the margin of victory was slimmer in the counties that Trump won compared with the counties that Clinton won.

Histograms

We have Clinton winning the popular vote on one hand, but Trump winning many more counties. How can we reconcile these two facts?

One theory is that Clinton won her counties by a huge margin percentage-wise, while Trump won his counties by a slim margin percentage-wise. To test this theory, we could plot a histogram of the per_point_diff:

ggplot() +
    geom_histogram(data = df, mapping = aes(x = per_point_diff)) + 
    labs(title = "Histogram of % vote margin", 
         x = "% Republicans won by", y = "Frequency")

The chart does not support the theory that Trump had narrower margins of victory in the counties that he won: he won a sizeable number of counties with > 50% vote difference.

Let’s try plotting a histogram of diff to look at absolute differences instead:

ggplot() +
    geom_histogram(data = df, mapping = aes(x = diff)) + 
    labs(title = "Histogram of absolute vote margin", 
         x = "No. of votes Republicans won by", y = "Frequency")

This chart is very different! In the counties that Clinton won, she won it by extremely large margins in terms of absolute votes. Thus, even though she won very few counties compared to Trump, these large margins meant that she could actually win the popular vote.

The code below shows that the top 45 counties with largest absolute vote difference were all won by Clinton (number 46 was Montgomery, TX, which went to Trump).

df %>% select(State = state_abbr, County = county_name, diff) %>%
    mutate(abs_diff = abs(diff)) %>%
    arrange(desc(abs_diff)) %>%
    select(State, County, `Vote difference` = diff) %>%
    head(n = 50) %>%
    kable()
State County Vote difference
CA Los Angeles County -1273485
IL Cook County -1088369
NY Kings County -461433
WA King County -459368
NY New York County -456546
PA Philadelphia County -455124
CA Alameda County -395162
CA Santa Clara County -346020
NY Queens County -334839
MA Middlesex County -292756
FL Miami-Dade County -289340
MI Wayne County -288934
FL Broward County -288435
MD Prince George’s County -284337
NY Bronx County -283979
CA San Francisco County -277950
DC District of Columbia -248670
MN Hennepin County -237515
MD Montgomery County -226776
OR Multnomah County -208699
OH Cuyahoga County -204080
TX Dallas County -196980
VA Fairfax County -196648
GA DeKalb County -191600
MA Suffolk County -191170
CA Contra Costa County -180839
CA San Diego County -180436
TX Travis County -179725
GA Fulton County -171503
NJ Essex County -168972
CA San Mateo County -165849
WI Milwaukee County -162895
TX Harris County -161511
MD Baltimore city -155836
WI Dane County -146236
OH Franklin County -143633
NC Mecklenburg County -137955
FL Orange County -134488
CO Denver County -130974
NY Westchester County -124027
CA Sacramento County -110744
LA Orleans Parish -109566
MN Ramsey County -106151
PA Allegheny County -105529
NC Wake County -104746
TX Montgomery County 104444
NJ Hudson County -104365
FL Palm Beach County -100649
MA Norfolk County -99958
TN Shelby County -91692

Optional material: An alternative to histograms

We didn’t go through this section in class as it uses ggplot in a slightly more involved way. Take a look at the code and see if you can figure out what’s going on.

In the chart below, each vertical bar represents one county.

temp <- df %>% arrange(desc(per_point_diff))
ggplot(data = temp) + 
    geom_col(mapping = aes(x = as.integer(row.names(temp)), 
                           y = per_point_diff,
                           fill = per_point_diff)) +
    scale_fill_gradient2(low = "blue", high = "red") + 
    labs(title = "% Difference in votes received by county", 
         x = "", y = "% Difference") + 
    theme(legend.position = "right")

The chart above does not support the theory that Trump had narrower margins of victory in the counties that he won, as compared to Clinton.

Let’s make the same chart, but with absolute difference in votes received (instead of percentage difference):

temp <- df %>% 
    select(diff) %>%
    arrange(desc(diff))
ggplot(data = temp) + 
    geom_col(mapping = aes(x = as.integer(row.names(temp)), 
                           y = diff,
                           fill = diff)) +
    scale_fill_gradient2(low = "blue", high = "red") + 
    labs(title = "Absolute difference in votes received by county", 
         x = "", y = "Absolute difference")

Conclusion

When analyzing elections, we have to examine the data from many different perspectives in order to get the full story.